-
Frontiers in Genetics 2021Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data...
Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA.
PubMed: 33747051
DOI: 10.3389/fgene.2021.642227 -
Genetic Epidemiology Feb 2022Count data with excessive zeros are increasingly ubiquitous in genetic association studies, such as neuritic plaques in brain pathology for Alzheimer's disease. Here, we...
Count data with excessive zeros are increasingly ubiquitous in genetic association studies, such as neuritic plaques in brain pathology for Alzheimer's disease. Here, we developed gene-based association tests to model such data by a mixture of two distributions, one for the structural zeros contributed by the Binomial distribution, and the other for the counts from the Poisson distribution. We derived the score statistics of the corresponding parameter of the rare variants in the zero-inflated Poisson regression model, and then constructed burden (ZIP-b) and kernel (ZIP-k) tests for the association tests. We evaluated omnibus tests that combined both ZIP-b and ZIP-k tests. Through simulated sequence data, we illustrated the potential power gain of our proposed method over a two-stage method that analyzes binary and non-zero continuous data separately for both burden and kernel tests. The ZIP burden test outperformed the kernel test as expected in all scenarios except for the scenario of variants with a mixture of directions in the genetic effects. We further demonstrated its applications to analyses of the neuritic plaque data in the ROSMAP cohort. We expect our proposed test to be useful in practice as more powerful than or complementary to the two-stage method.
Topics: Binomial Distribution; Humans; Models, Genetic; Models, Statistical; Phenotype; Poisson Distribution
PubMed: 34779034
DOI: 10.1002/gepi.22438 -
BMC Medical Research Methodology Jan 2022We consider cluster size data of SARS-CoV-2 transmissions for a number of different settings from recently published data. The statistical characteristics of...
BACKGROUND
We consider cluster size data of SARS-CoV-2 transmissions for a number of different settings from recently published data. The statistical characteristics of superspreading events are commonly described by fitting a negative binomial distribution to secondary infection and cluster size data as an alternative to the Poisson distribution as it is a longer tailed distribution, with emphasis given to the value of the extra parameter which allows the variance to be greater than the mean. Here we investigate whether other long tailed distributions from more general extended Poisson process modelling can better describe the distribution of cluster sizes for SARS-CoV-2 transmissions.
METHODS
We use the extended Poisson process modelling (EPPM) approach with nested sets of models that include the Poisson and negative binomial distributions to assess the adequacy of models based on these standard distributions for the data considered.
RESULTS
We confirm the inadequacy of the Poisson distribution in most cases, and demonstrate the inadequacy of the negative binomial distribution in some cases.
CONCLUSIONS
The probability of a superspreading event may be underestimated by use of the negative binomial distribution as much larger tail probabilities are indicated by EPPM distributions than negative binomial alternatives. We show that the large shared accommodation, meal and work settings, of the settings considered, have the potential for more severe superspreading events than would be predicted by a negative binomial distribution. Therefore public health efforts to prevent transmission in such settings should be prioritised.
Topics: Binomial Distribution; COVID-19; Humans; Pandemics; Poisson Distribution; SARS-CoV-2
PubMed: 35094680
DOI: 10.1186/s12874-022-01517-9 -
Scientific Reports Dec 2020This paper presents a study of early epidemiological assessment of COVID-19 transmission dynamics in Indonesia. The aim is to quantify heterogeneity in the numbers of...
This paper presents a study of early epidemiological assessment of COVID-19 transmission dynamics in Indonesia. The aim is to quantify heterogeneity in the numbers of secondary infections. To this end, we estimate the basic reproduction number [Formula: see text] and the overdispersion parameter [Formula: see text] at two regions in Indonesia: Jakarta-Depok and Batam. The method to estimate [Formula: see text] is based on a sequential Bayesian method, while the parameter [Formula: see text] is estimated by fitting the secondary case data with a negative binomial distribution. Based on the first 1288 confirmed cases collected from both regions, we find a high degree of individual-level variation in the transmission. The basic reproduction number [Formula: see text] is estimated at 6.79 and 2.47, while the overdispersion parameter [Formula: see text] of a negative-binomial distribution is estimated at 0.06 and 0.2 for Jakarta-Depok and Batam, respectively. This suggests that superspreading events played a key role in the early stage of the outbreak, i.e., a small number of infected individuals are responsible for large numbers of COVID-19 transmission. This finding can be used to determine effective public measures, such as rapid isolation and identification, which are critical since delay of diagnosis is the most common cause of superspreading events.
Topics: Basic Reproduction Number; COVID-19; Computer Simulation; Humans; Indonesia; Models, Biological; SARS-CoV-2
PubMed: 33372191
DOI: 10.1038/s41598-020-79352-5 -
Computer Methods and Programs in... Oct 2021Our goal is to provide an overall strategy for utilizing continuous accelerated life models in the discrete setting that provides a unique and flexible modeling approach... (Review)
Review
BACKGROUND AND OBJECTIVE
Our goal is to provide an overall strategy for utilizing continuous accelerated life models in the discrete setting that provides a unique and flexible modeling approach across a variety of hazard shapes.
METHODS
We convert well-known continuous accelerated life distributions into their discrete counterpart and show theoretically that the existing software that currently exists to accommodate, left, right and interval censoring in the continuous case is re-usable in the discrete setting due to the structure of the likelihood equations.
RESULTS
We demonstrate across a variety of simulated and real-world data that our modeling approach can accommodate discrete data that may either be approximately symmetric, left-skewed or right skewed, overcoming the limitations of more traditional modeling approaches.
CONCLUSIONS
We illustrate both theoretically and through simulations that our approach for accommodating discrete failure time and count data is quite flexible. We demonstrate that the special case of the discrete Weibull model readily can accommodate truly Poisson distributed data and has a great degree of flexibility for non-Poisson distributed data.
Topics: Models, Statistical; Software; Survival Analysis
PubMed: 34469807
DOI: 10.1016/j.cmpb.2021.106337 -
Theoretical Ecology Mar 2019Second-order statistics such as the variance and autocorrelation can be useful indicators of the stability of randomly perturbed systems, in some cases providing early...
Second-order statistics such as the variance and autocorrelation can be useful indicators of the stability of randomly perturbed systems, in some cases providing early warning of an impending, dramatic change in the system's dynamics. One specific application area of interest is the surveillance of infectious diseases. In the context of disease (re-)emergence, a goal could be to have an indicator that is informative of whether the system is approaching the epidemic threshold, a point beyond which a major outbreak becomes possible. Prior work in this area has provided some proof of this principle but has not analytically treated the effect of imperfect observation on the behavior of indicators. This work provides expected values for several moments of the number of reported cases, where reported cases follow a binomial or negative binomial distribution with a mean based on the number of deaths in a birth-death-immigration process over some reporting interval. The normalized second factorial moment and the decay time of the number of reported cases are two indicators that are insensitive to the reporting probability. Simulation is used to show how this insensitivity could be used to distinguish a trend of increased reporting from a trend of increased transmission. The simulation study also illustrates both the high variance of estimates and the possibility of reducing the variance by averaging over an ensemble of estimates from multiple time series.
PubMed: 34552670
DOI: 10.1007/s12080-018-0390-3 -
NeuroImage Nov 2016Permutation tests are increasingly being used as a reliable method for inference in neuroimaging analysis. However, they are computationally intensive. For small,...
Permutation tests are increasingly being used as a reliable method for inference in neuroimaging analysis. However, they are computationally intensive. For small, non-imaging datasets, recomputing a model thousands of times is seldom a problem, but for large, complex models this can be prohibitively slow, even with the availability of inexpensive computing power. Here we exploit properties of statistics used with the general linear model (GLM) and their distributions to obtain accelerations irrespective of generic software or hardware improvements. We compare the following approaches: (i) performing a small number of permutations; (ii) estimating the p-value as a parameter of a negative binomial distribution; (iii) fitting a generalised Pareto distribution to the tail of the permutation distribution; (iv) computing p-values based on the expected moments of the permutation distribution, approximated from a gamma distribution; (v) direct fitting of a gamma distribution to the empirical permutation distribution; and (vi) permuting a reduced number of voxels, with completion of the remainder using low rank matrix theory. Using synthetic data we assessed the different methods in terms of their error rates, power, agreement with a reference result, and the risk of taking a different decision regarding the rejection of the null hypotheses (known as the resampling risk). We also conducted a re-analysis of a voxel-based morphometry study as a real-data example. All methods yielded exact error rates. Likewise, power was similar across methods. Resampling risk was higher for methods (i), (iii) and (v). For comparable resampling risks, the method in which no permutations are done (iv) was the absolute fastest. All methods produced visually similar maps for the real data, with stronger effects being detected in the family-wise error rate corrected maps by (iii) and (v), and generally similar to the results seen in the reference set. Overall, for uncorrected p-values, method (iv) was found the best as long as symmetric errors can be assumed. In all other settings, including for familywise error corrected p-values, we recommend the tail approximation (iii). The methods considered are freely available in the tool PALM - Permutation Analysis of Linear Models.
Topics: Algorithms; Brain; Computer Simulation; Data Interpretation, Statistical; Humans; Image Enhancement; Image Interpretation, Computer-Assisted; Models, Statistical; Neuroimaging; Reproducibility of Results; Sensitivity and Specificity
PubMed: 27288322
DOI: 10.1016/j.neuroimage.2016.05.068 -
Bulletin of Mathematical Biology Apr 2016The Anderson-May model of human parasite infections and specifically that for the intestinal worm Ascaris lumbricoides is reconsidered, with a view to deriving the...
The Anderson-May model of human parasite infections and specifically that for the intestinal worm Ascaris lumbricoides is reconsidered, with a view to deriving the observed characteristic negative binomial distribution which is frequently found in human communities. The means to obtaining this result lies in reformulating the continuous Anderson-May model as a stochastic process involving two essential populations, the density of mature worms in the gut, and the density of mature eggs in the environment. The resulting partial differential equation for the generating function of the joint probability distribution of eggs and worms can be partially solved in the appropriate limit where the worm lifetime is much greater than that of the mature eggs in the environment. Allowing for a mean field nonlinearity, and for egg immigration from neighbouring communities, a negative binomial worm distribution can be predicted, whose parameters are determined by those in the continuous Anderson-May model; this result assumes no variability in predisposition to the infection.
Topics: Animals; Ascariasis; Ascaris lumbricoides; Binomial Distribution; Digestive System; Humans; Mathematical Concepts; Models, Biological; Nonlinear Dynamics; Parasite Egg Count; Stochastic Processes
PubMed: 27066982
DOI: 10.1007/s11538-016-0164-2 -
PloS One 2021Disease mapping aims to determine the underlying disease risk from scattered epidemiological data and to represent it on a smoothed colored map. This methodology is...
Disease mapping aims to determine the underlying disease risk from scattered epidemiological data and to represent it on a smoothed colored map. This methodology is based on Bayesian inference and is classically dedicated to non-infectious diseases whose incidence is low and whose cases distribution is spatially (and eventually temporally) structured. Over the last decades, disease mapping has received many major improvements to extend its scope of application: integrating the temporal dimension, dealing with missing data, taking into account various a prioris (environmental and population covariates, assumptions concerning the repartition and the evolution of the risk), dealing with overdispersion, etc. We aim to adapt this approach to model rare infectious diseases proposing specific and generic variants of this methodology. In the context of a contagious disease, the outcome of a primary case can in addition generate secondary occurrences of the pathology in a close spatial and temporal neighborhood; this can result in local overdispersion and in higher spatial and temporal dependencies due to direct and/or indirect transmission. In consequence, we test models including a Negative Binomial distribution (instead of the usual Poisson distribution) to deal with local overdispersion. We also use a specific spatio-temporal link in order to better model the stronger spatial and temporal dependencies due to the transmission of the disease. We have proposed and tested 60 Bayesian hierarchical models on 400 simulated datasets and bovine tuberculosis real data. This analysis shows the relevance of the CAR (Conditional AutoRegressive) processes to deal with the structure of the risk. We can also conclude that the negative binomial models outperform the Poisson models with a Gaussian noise to handle overdispersion. In addition our study provided relevant maps which are congruent with the real risk (simulated data) and with the knowledge concerning bovine tuberculosis (real data).
Topics: Animals; Bayes Theorem; Binomial Distribution; Cattle; Disease; Humans; Incidence; Models, Statistical; Poisson Distribution; Tuberculosis, Bovine
PubMed: 33439868
DOI: 10.1371/journal.pone.0222898 -
Frontiers in Psychology 2017Statistical analysis is crucial for research and the choice of analytical technique should take into account the specific distribution of data. Although the data... (Review)
Review
Statistical analysis is crucial for research and the choice of analytical technique should take into account the specific distribution of data. Although the data obtained from health, educational, and social sciences research are often not normally distributed, there are very few studies detailing which distributions are most likely to represent data in these disciplines. The aim of this systematic review was to determine the frequency of appearance of the most common non-normal distributions in the health, educational, and social sciences. The search was carried out in the Web of Science database, from which we retrieved the abstracts of papers published between 2010 and 2015. The selection was made on the basis of the title and the abstract, and was performed independently by two reviewers. The inter-rater reliability for article selection was high (Cohen's kappa = 0.84), and agreement regarding the type of distribution reached 96.5%. A total of 262 abstracts were included in the final review. The distribution of the response variable was reported in 231 of these abstracts, while in the remaining 31 it was merely stated that the distribution was non-normal. In terms of their frequency of appearance, the most-common non-normal distributions can be ranked in descending order as follows: gamma, negative binomial, multinomial, binomial, lognormal, and exponential. In addition to identifying the distributions most commonly used in empirical studies these results will help researchers to decide which distributions should be included in simulation studies examining statistical procedures.
PubMed: 28959227
DOI: 10.3389/fpsyg.2017.01602